Exploring Loki Query Syntax
Understand the Loki Query Syntax for querying the logs.
Overview of the Loki Query Syntax#
As we mentioned earlier, LogQL is a slightly modified subset of the well-known PromQL many of us are using daily. That doesn't mean that we need to be an expert in PromQL to use it but rather that there's a similarity that some can leverage. Those with no prior knowledge should still have an easy time grasping it. It’s relatively simple.
That being said, teaching LogQL or PromQL is out of the scope of this course, so we suggest checking the documentation if that's a need. Our goal is to evaluate whether Loki might be the right choice for your needs, so we’ll explore it briefly without going into details, especially not on the query-language level.
Let’s take a look at the last expression and compare it to PromQL used by Prometheus.
First of all, there's no metric name we might be used to when working with Prometheus. Actually our beloved some_metric{foo="bar"} is just a shorthand for {__name__="some_metric", foo="bar"}. So, there isn't a really big difference there. We're selecting a log stream specifying labels just as we do with metrics in Prometheus.
That part of the query is called log stream selector. The usual equal and not equal operators (=, !=) are present, alongside their regex counterparts (=~, !~). What is notable in Loki is the job label. It's a convenience label that consists of a namespace and a replication controller name. Replication controllers are Deployment, StatefulSet, and DaemonSet. Basically, it's the name of the workload we want to investigate.
What's really different is the latter part (!= "GET request to /"), called a filter expression. The log exploration routine usually involves narrowing down the log stream to relevant parts.
In the good old days of Linux servers, we used to grep logs. For example, to get all problematic requests for example.com, we'd do something like the command that follows.
Note: Do not run the command that follows. It's meant to show the equivalent “old” way of dealing with logs.
That command would output the content of the access.log file and pipe it to the grep command. In turn, it would search for lines containing example.com and pipe it further to yet another grep. The final grep would filter out all the lines that contained 200 (HTTP OK code). The equivalent LogQL query would look like the snippet that follows.
Feel free to type that query in Grafana and execute it. The result should be all the entries containing example.com that do not include 200. In other words, that query returns all entries that mention example.com and contain errors (non-200 response codes).
If there are no results, that’s normal. It means that the app didn't produce error entries that also contain example.com. We’ll simulate errors soon.
Note: The query isn't really correct since response codes above
200and below400are also not errors, but let’s not get picky about it.
Create a new dashboard using Grafana#
Up to now, we were executing log queries that are supposed to return log lines. The recent addition to LogQL is metric queries that are more in line with PromQL. They calculate values based on log entries. For example, we might want to count log lines or calculate the rate at which they are produced. We can perform such tasks in the Explore mode, but we can also create real dashboards based on metric queries.
Let’s return to Grafana and create a new dashboard.
We should be presented with a screen with a prominent blue button saying "+ Add new panel." Click it.
Search for the field next to the drop-down list with the "Log labels" selected, type the query that follows, and execute it.
If you're familiar with PromQL, you might have an idea of what we did. We retrieved the 10 noisiest workloads in our cluster, grouped by the job.
Note: You might see only the result on the far right of the graph. That’s normal because, by default, the graph only shows the last six hours, and you've likely had Loki running for a much shorter period.
That's quite noisy, so let’s exclude built-in components by excluding workloads from the kube-system namespace.
Please use the query that follows.
Now we should see fewer jobs because those from the kube-system namespace are now excluded.
That was already quite useful and indistinguishable from the PromQL query. But we're working with log lines, so let’s try to utilize that.
Please use the query that follows.
This time, we excluded all workloads in kube-system (just like before), but we also filtered results to only those containing the word error. In our case, that's only the promtail running in the monitoring namespace. It probably failed the first time it ran. It depends on the Loki server, which tends to take more time to boot.
That query combined features of both PromQL and LogQL. We used the topk() function from PromQL and the filter expression (|= "error") from LogQL.
Please visit the Log Query Language section of the documentation for more info.
More on Log queries#
Most of the time, we're not interested in application logs. Most of us tend to look at logs only when things go wrong. We find out there's an issue by receiving alerts based on metrics. Only after that, we might explore logs to deduce the problem. Even in those cases, logs rarely provide value alone. We can have meaningful insights only when logs are combined with metrics.
In those dire times, when things do go wrong, we need all the help we can get to investigate the problem. Luckily, Grafana’s Explore mode allows us to create a split view. We can, for example, combine the results from querying logs with those coming from metrics. Let’s try it out.
Just like before, type the query that follows and press the "Shift" and "Enter" keys to execute it.
That query returns logs from Loki. It contains only those with the word ERROR and associated with go-demo-9-go-demo-9 running in the production namespace.
Next, we’ll create a second view.
Press the "Split" button in the top-right corner of the page and select Prometheus as the source for that panel (it's currently set to Loki). Type the query that follows into the second view and press the "Shift" and "Enter" keys to execute it.
That query returns metrics from Prometheus. It returns the response rate grouped by the path.
Those two queries might not be directly related. However, they demonstrate the ability to correlate different queries that can be even from various sources. Through those two, we might be able to deduce whether there's a relation between the errors from a specific application and the responses from the whole cluster.
Note: Don't take this as a suggestion that these are the most useful queries. They're not. They're just a demonstration that multiple queries from different sources can be presented together.
The output of both queries is likely empty or is very sparse. That’s normal because our demo app isn't receiving any traffic. We’ll change that soon.
Load Testing#
Since we're trying to correlate request metrics with errors recorded in logs, we should generate a bit of traffic to make that a bit more meaningful.
We'll use Siege to storm our application with requests. We’ll do it twice, once with “normal” requests, and once with the endpoint that produces errors.
First, let’s get the baseline address with the path pointing to /demo/hello. That one never returns errors.
Now we can run Siege.
We’ll be sending 10 concurrent requests during 30 seconds.
The output is not that important. What matters is that we sent hundreds of requests to the endpoint that should be responding with the status code 200. The availability should be 100%, or slightly lower.
After the Pod is done, go back to Grafana and re-run both queries by clicking the buttons with blue circled arrows or selecting the fields with the queries and pressing the "Shift" and "Enter" keys.
Prometheus should draw a spike, and Loki will show nothing. That's expected because the query with logs from Loki should only display the entries with errors, and we didn't generate any.
Let’s repeat Siege but this time with the path /demo/random-error. It generates random errors.
And again, after the Pod is finished executing, let's go to Grafana and refresh both panels.
As we can see, the Prometheus’s graph got a new group of entries for the new path, and Loki returned Something, somewhere, went wrong! messages.
If we re-run Siege by sending requests to /demo/random-error several times, we'll clearly see the correlation between red bars in Loki’s graph and request rate spikes to a specific path in the Prometheus panel.
As you can see, Loki, combined with Grafana, is a solution for log aggregation that combines low administration footprint and integration with other observability tools like Grafana and Prometheus.
A lot of capabilities are left to explore. We can graph metric queries and set thresholds and alerts in Grafana. If our application is instrumented (if it exposes internal metrics), we can combine them with the logs. We could even add request tracing and teach Loki to parse trace IDs and highlight them as links to a UI specialized in tracing.
Destroying the resources#
We're done with this chapter, so let’s clean up all the tools we installed. There’s probably no need to comment on what we’ll do since it’s similar to what we did at the end of all the previous sections. We’ll just do it.
Note: We'll get hands-on experience with the concepts and commands discussed in this lesson in the project "Hands-on: Using Centralized Logging" right after this chapter.
Playing with the Loki Stack
Quiz: Centralized Logging